Yield Prediction of Citrus Based on Soil and Climatic Variables

Weiping Wu<sup>1,2</sup>

Research Insight

Yield Prediction of Citrus Based on Soil and Climatic Variables

Weiping Wu^1,2

1 Hangzhou Yinghe Jiatian Technology Co., Ltd., Hangzhou, 310056, Zhejiang, China
2 Zhejiang Agronomist College, Hangzhou, 310021, Zhejiang, China

Author

Correspondence author
Computational Molecular Biology, 2026, Vol. 16, No. 1
Received: 26 Dec., 2025 Accepted: 31 Jan., 2026 Published: 12 Feb., 2026

This is an open access article published under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Accurate prediction of citrus yield is essential for optimizing agricultural management and ensuring food security. This study develops an integrated framework for citrus yield prediction based on soil and climate variables using multi-source data. Key soil properties and climatic factors are systematically analyzed to reveal their individual and interactive effects on yield formation. Both traditional statistical models and machine learning approaches, including Random Forest and Support Vector Machine, are employed and compared. Data preprocessing, feature selection, and model optimization strategies are implemented to improve prediction accuracy. A case study in a typical citrus-producing region demonstrates the applicability and robustness of the proposed approach. Results indicate that soil–climate coupling significantly enhances predictive performance, while key driving factors such as temperature, precipitation, and soil nutrient content play critical roles. The study provides valuable insights for precision agriculture and supports decision-making in citrus production under varying environmental conditions.

Keywords

Citrus yield prediction; Soil variables; Climate factors; Machine learning; Precision agriculture

1 Introduction

Citrus is one of the world’s major fruit crops, cultivated in more than 140 countries and contributing over 100 million tons annually to global production, with oranges alone accounting for more than half of this volume (Liu et al., 2012; Zhong and Nicolosi, 2020). The citrus industry underpins rural livelihoods, export earnings, and agro‑processing in major producing regions such as China, Brazil, the United States, the Mediterranean basin, and parts of Africa and South Asia. Growing international trade and demand for both fresh and processed citrus products have increased the need for stable, high yields and reliable supply forecasts to support marketing, storage, processing, and logistics decisions across the value chain. At the same time, climate change and soil degradation threaten the sustainability of citrus production systems, intensifying interest in quantitative tools that can anticipate yield variability and guide adaptive management (Wang et al., 2022).

Citrus yield reflects complex interactions between genotype, orchard management, soil conditions, and weather across key phenological stages (Díaz et al., 2017; Shafqat et al., 2021). Soil physical and chemical properties such as pH, salinity (e.g., sodium adsorption ratio, exchangeable sodium percentage), free CaCO₃, organic carbon, texture, and cation exchange capacity strongly influence root growth, water and nutrient availability, and ultimately fruit yield and quality. For example, higher soil pH, free CaCO₃ and sodicity have been associated with significant reductions in sweet orange yield and juice quality, whereas higher soil organic carbon and adequate cation exchange capacity support better yields and vitamin C content. Spatial variability in soil series, apparent electrical conductivity and elevation within groves also explains substantial within‑orchard yield differences and can be used to delineate productivity zones. In parallel, climatic variables-including temperature, precipitation, relative humidity, solar radiation, and heat units-affect floral induction, fruit set, fruit growth, and internal quality. Citrus development is optimized within relatively narrow temperature ranges; deviations, especially heat stress and water deficit, reduce CO₂ assimilation, impair fruit growth, increase acidity, and cause fruit drop. Recent empirical and scenario‑based studies have shown that mean temperature, diurnal temperature range, minimum and maximum temperatures, and humidity during critical growth and ripening periods have strong, often nonlinear, effects on citrus yield and quality, and will drive regional yield patterns under future climate change. These intertwined soil and climate mechanisms create substantial temporal and spatial yield variability that is difficult to capture using simple linear models.

In this context, accurate yield prediction models that explicitly integrate both soil and climatic variables have become crucial for precision citriculture and climate‑smart planning. Existing approaches have increasingly used statistical and machine learning techniques, including multiple linear regression, regression trees, support vector machines, and artificial neural networks, often combined with remote sensing indices or accumulated heat units, to forecast citrus yield at tree, block, or orchard scales. Many of these models, however, emphasize either climatic or management factors, or rely on spectral proxies, while soil properties are treated only indirectly or at coarse resolution. There remains a need for models that systematically couple detailed soil physicochemical indicators with key climatic drivers over multi‑year periods to improve robustness, interpretability, and generalization across sites and cultivars. The present paper addresses this gap by developing a citrus yield prediction framework based on jointly analyzed soil and climatic variables. The objectives are: (i) to quantify the relative contributions and interactions of major soil and weather factors controlling citrus yield; (ii) to construct and evaluate predictive models using these variables; and (iii) to discuss their potential application in site‑specific management, regional yield forecasting, and adaptation to climate change. The paper is structured as follows: Section 2 describes the study area, datasets, and soil and climate measurements; Section 3 outlines the modeling methods and evaluation metrics; Section 4 presents the results on factor importance and model performance; Section 5 discusses implications for citrus management under changing climatic conditions; and Section 6 summarizes the main conclusions and future research directions.

2 Research Progress on Soil and Climate Variables in Crop Yield Prediction

2.1 Review of citrus yield prediction methods at home and abroad

Citrus yield prediction has moved from manual visual estimation to sensor‑ and data‑driven methods that link orchard conditions with expected output. Remote sensing from satellites and UAVs is widely used to derive vegetation and water indices (e.g., NDVI, NDWI) that correlate canopy status with block‑ or parcel‑level citrus yield, often combined with multiyear field records (Moussaid et al., 2022; Suarez et al., 2023). Machine learning models such as support vector machines and orthonormal pursuit capture nonlinear relationships among spectral indices, management inputs, and yield, achieving high R² (≈0.85-0.88) and relatively low error for early‑season or pre‑harvest forecasts. Deep learning further improves citrus yield prediction by automatically extracting features from time‑series satellite indices and multi‑source field data. A deep neural network combining fertilization, irrigation, climate, and Sentinel‑2 NDVI/NDWI data obtained percentage errors near 10%, outperforming traditional methods at parcel scale. UAV imagery plus deep learning object detection and LSTM have also been used to detect, count, and size fruits at tree level, reducing estimation error compared with expert visual assessment and supporting fine‑grained yield management.

2.2 Application of soil factors in agricultural prediction models

Soil variables are key inputs in crop yield models, reflecting water and nutrient supply, physical structure, and biotic activity. Proximal sensing of soil electrical conductivity, moisture, slope, and chemistry, combined with machine learning regression, has explained large portions of yield variability in potatoes; soil moisture alone accounted for 57-66% of yield variation in some cases (Abbas et al., 2020). Similarly, sub‑field models using soil pH, organic matter, cation exchange capacity, and macronutrients with random forests achieved R² up to 0.85-0.94 for corn and soybean, highlighting the predictive power of detailed soil and topographic data (Burdett and Wellen, 2022). Beyond bulk properties, new work integrates vertical soil stratification and biological indicators into yield prediction. A partition‑prioritized ensemble framework that explicitly uses stratified soil layers (0-100 cm) improved maize yield prediction accuracy by at least 11.76%, demonstrating that deep soil horizons are as influential as topsoil.Machine learning models using soil organic carbon and specific bacterial biomarker communities also attained high accuracy (R² ≈0.81-0.82), with bacterial markers contributing over 40% of variable importance, indicating that soil microbiome structure can be an efficient proxy for soil fertility in predictive modeling.

2.3 Advances in climate variables for crop yield modeling

Climate variables-temperature, precipitation, radiation, humidity, and derived indices-are central to explaining interannual yield variability and climate‑change impacts. Integrating calibrated crop growth models (e.g., DSSAT) with climate scenarios and machine learning has clarified how precipitation and maximum temperature alternately dominate yield variability under different emission pathways, with higher accuracy in precipitation‑controlled periods (R² ≈0.81) (El-Mahroug et al., 2025). At larger scales, mixed‑effects meta‑models and process‑based simulations show substantial projected yield declines for major cereals under high‑emission scenarios, emphasizing the necessity of including temperature, rainfall, and CO₂ jointly to avoid underestimating losses. Advanced statistical and deep learning approaches increasingly treat climate as both a direct driver and a context for other predictors. Transparent climate‑only models using monthly vapor pressure deficit and precipitation splines have reached R² ≈0.79-0.85 for rainfed maize, and adding vegetation indices further improves performance. Multi‑model frameworks that explicitly separate and combine the effects of temperature and precipitation indicate that 1 °C warming can reduce global maize yield by about 5-7%, while increased rainfall partly offsets these losses, especially under wetter conditions, underscoring the importance of representing climate interactions in yield prediction (Figure 1) (Yin et al., 2022).

Figure 1 Figure 1 Conceptual framework illustrating how major climate variables (temperature, precipitation, radiation, and humidity) and derived indices influence crop yield through biophysical processes (Adopted from Yin et al., 2022)

Across crops, yield prediction has evolved toward integrated models that fuse soil properties, climatic drivers, and remote‑sensing information using machine learning and deep learning. For citrus, combining field management, climate, and spectral indices yields accurate parcel‑ and tree‑level forecasts. Soil factors-from stratified physicochemical layers to microbial biomarkers-substantially enhance prediction, while refined climate representations capture both mean changes and variability. These advances provide a methodological foundation for citrus yield prediction based on soil and climatic variables.

3 Theoretical Framework and Methodological System for Citrus Yield Prediction Based on Multi‑Source Data

3.1 Theoretical foundations of crop growth models and statistical modeling

Crop yield prediction has long relied on process‑based crop growth models that describe photosynthesis, respiration, biomass accumulation and partitioning as functions of radiation, temperature, water and nutrients. These models encode biophysical understanding of genotype-environment-management (G×E×M) interactions and allow simulation of yield responses under alternative climate and management scenarios (Cooper et al., 2020). By tracking resource capture and use efficiency (e.g., water, radiation, nutrients), crop growth models can attribute yield variation to specific limiting factors, providing a mechanistic benchmark for evaluating empirical models (Wang et al., 2022). Such frameworks are particularly useful for perennial crops like citrus, where long life cycles and complex phenology complicate purely empirical approaches.

However, process‑based models require detailed parameterization and often struggle with data scarcity or structural uncertainty, especially for fruit trees. To address these constraints, statistical yield models link observed yield directly with weather, soil and management variables using regression and related techniques. Linear and multiple linear regression, interaction regression and mixed models quantify marginal and interaction effects of factors such as temperature, rainfall and soil properties on yield (Ansarifar et al., 2021). Recent interaction regression models explicitly decompose yield into contributions from weather, soil and management, balancing prediction accuracy with interpretability and enabling agronomic insight into key drivers. For citrus, combining empirical regression with climate indices has already been shown to capture effects of monthly temperature and humidity on yield and quality across regions.

3.2 Principles of machine learning applications in agricultural prediction

With the expansion of environmental monitoring, remote sensing and IoT systems, machine learning (ML) has become central to crop yield prediction. ML techniques treat yield as an unknown, possibly nonlinear function of multi‑source inputs, including climate, soil, management and spectral indices (Iniyan et al., 2023). Tree‑based ensembles, support vector regression and gradient‑boosting methods can model complex interactions and non‑additive effects without explicit process equations, often outperforming traditional regression when sufficient data are available (Mahesh and Soundrapandiyan, 2024). Multi‑year case studies show that ML models can achieve very low normalized RMSE at regional scales, supporting operational yield forecasting and early‑season prediction.

Deep learning extends these principles by explicitly handling spatio‑temporal data. Convolutional and recurrent neural networks can ingest time series of weather, management and remotely sensed vegetation indices, capturing temporal dependencies and spatial heterogeneity (Khaki et al., 2019). Hybrid CNN-RNN architectures have achieved strong performance across large regions, while parcel‑scale citrus studies combining field variables with NDVI/NDWI images have demonstrated high accuracy before harvest. Nevertheless, ML and deep learning models face challenges of overfitting, feature redundancy and limited interpretability. Reviews emphasize the need for robust workflows, careful feature engineering, cross‑validation, and appropriate evaluation metrics (e.g., RMSE, R², MAE) to ensure generalizable and trustworthy predictions in agricultural settings (Shawon et al., 2024).

3.3 Construction of a soil-climate coupled prediction framework

A soil-climate coupled prediction framework for citrus yield integrates mechanistic insights from crop growth theory with data‑driven ML modeling on multi‑source datasets. Process understanding guides the selection of key soil attributes (e.g., texture, fertility indicators, water‑holding capacity) and climate descriptors (e.g., temperature, precipitation, humidity, heat units) that are known to influence citrus phenology, fruit set and quality (Wang et al., 2022). These variables can be augmented with derived indices (climate suitability, extremes, vegetation indices) and, where available, simulated outputs from crop or species distribution models to form physically meaningful features (Paudel et al., 2020; Catalano et al., 2025). Such a design helps reduce information leakage and ensures that predictors remain aligned with agronomic processes rather than purely statistical correlations.

On this basis, the methodological system can follow a modular workflow: data preprocessing and quality control; feature construction and selection; model training and hyperparameter optimization; and multi‑level evaluation (tree, parcel, regional scales). Ensemble ML models and deep neural networks can then be trained to map coupled soil-climate inputs to observed citrus yields, while interaction‑aware or hybrid data‑driven crop models provide complementary interpretability. Citrus‑specific applications already illustrate the value of combining climatic factors and accumulated heat units within neural networks, and of merging field data with satellite‑derived indices in deep learning architectures (Figure 2) (Almady et al., 2024; Moussaid et al., 2023). Extending these ideas, the proposed framework emphasizes generalization across orchards and seasons, and supports scenario analysis under future climate conditions, thereby offering a robust basis for precision management and strategic planning in citrus production.

Figure 2 Application of the soil-climate coupled framework for scenario analysis and precision management in citrus production (Adopted from Almady et al., 2024)

4 Data System Construction and Variable Processing Methods for Citrus Yield Prediction

4.1 Sources of soil and climate data and indicator system

Building a citrus yield prediction system based on soil-climate variables requires integrating multi‑source datasets at appropriate temporal and spatial scales. Climate information can be obtained from national meteorological services, reanalysis products, or gridded databases providing daily or monthly precipitation, air temperature, and relative humidity, which are commonly used as inputs to yield models. For citrus, such records allow derivation of yearly or seasonal averages and accumulated heat units, which have proven effective predictors when coupled with neural networks for citrus yield (Almady et al., 2024). In parallel, global and regional soil databases, as well as farm‑level surveys, can supply soil texture, pH, organic matter, nutrient content, and water‑holding capacity; these soil attributes explain substantial yield variability when combined with climate in machine‑learning models (Sihi et al., 2022).

The indicator system should reflect both data availability and physiological relevance. Typical crop yield prediction datasets include variables such as soil type, land slope, soil pH and irrigated area, alongside rainfall and temperature, as shown in recent hybrid feature‑selection frameworks (Gupta et al., 2022; Abdel-Salam et al., 2024). Large‑scale studies further demonstrate the importance of including soil hydraulic and nutrient parameters together with vegetation indices and land surface temperature to capture spatial yield patterns across regions. For citrus, a core indicator set can therefore be organized into climate (e.g., precipitation, mean/max/min temperature, humidity, heat units), soil (e.g., pH, texture, organic matter, N-P-K, water‑holding capacity), topography (elevation, slope), and management/area variables, ensuring compatibility with existing crop and citrus‑specific ANN models.

4.2 Data preprocessing and outlier correction methods

Heterogeneous soil and climate data must undergo rigorous preprocessing before model training. Common steps include unit harmonization, temporal aggregation (e.g., daily to seasonal), and normalization or standardization to place variables on comparable scales, which is routinely applied in SVR‑based and other ML yield frameworks (Gupta et al., 2022; Abdel-Salam et al., 2024). When integrating multi‑source satellite and environmental datasets, careful spatial resampling and masking are needed so that climate, soil, and vegetation indices correspond to the same grid cells or orchard parcels. Crop yield datasets collected from government portals and IoT systems are often merged by keys such as region, year, and crop, followed by filtering to remove inconsistent or duplicated entries.

Missing data and outliers are particularly critical for long climate records and sparse soil surveys. Comparative studies on sunflower yield prediction show that naive deletion of missing values degrades regression performance, whereas imputation methods-especially random‑forest‑based prediction-can raise R² from about 0.72 to 0.94 (Călin et al., 2023). The same work demonstrates that unsupervised outlier detection algorithms such as OneClass‑SVM further improve accuracy when used prior to model fitting. Other research evaluates multiple outlier detection methods (e.g., isolation forest, elliptic envelope, one‑class SVM) and finds that appropriate removal of anomalous samples substantially refines yield prediction in regional agricultural datasets (Anu et al., 2025). Together, these findings suggest that robust imputation combined with data‑driven outlier screening is essential for reliable citrus yield modeling.

4.3 Feature engineering and key variable selection

Feature engineering transforms raw soil and climate measurements into informative predictors that better capture crop responses. For climate, this includes calculating growing‑season averages, extremes, and agroclimatic indices such as growing degree days and heat‑use efficiency, which have been successfully linked to citrus yield using ANN models (Almady et al., 2024). Similarly, agrometeorological models for sugarcane derive cumulative degree days and soil water storage during distinct growth phases, then use stepwise regression to identify which phases most affect yield (Viana et al., 2023). At larger scales, combining environmental descriptors (e.g., soil hydraulic properties, topography) with multiple satellite‑derived variables such as vegetation indices, solar‑induced fluorescence, land‑surface temperature, and microwave vegetation optical depth has been shown to markedly improve maize, rice, and soybean yield prediction.

Given the high dimensionality of multi‑source datasets, feature selection and dimension reduction are crucial for avoiding overfitting and improving efficiency. Hybrid frameworks frequently apply correlation‑based filters and clustering to remove redundant or weakly related variables before advanced selection. Subsequent algorithms such as Relief, recursive feature elimination, or combined selection-extraction schemes (FSX) substantially enhance ML performance, with FSX‑based models improving RMSE by up to 60% relative to using all features (Gupta et al., 2022; Pham et al., 2022). Systematic evaluations also show that carefully chosen subsets of soil, climate, and management variables allow random forests, SVMs, and neural networks to reach high predictive accuracy in crop yield tasks. For citrus, adopting such hybrid feature‑engineering strategies on soil-climate-remote‑sensing inputs can highlight the most influential indicators and support both accurate prediction and agronomic interpretability.

5 Multi-Model Construction and Optimization Methods for Citrus Yield Prediction

5.1 Development and applicability analysis of traditional statistical models

Traditional statistical models, such as multiple linear regression and time‑series approaches, offer an interpretable starting point for citrus yield prediction. Linear and multiple regression can directly relate yield to soil and climate variables, providing explicit coefficients that quantify marginal effects and interactions (Kumar et al., 2022). Time‑series models like ARIMA and ARIMAX extend this idea by capturing temporal dependence in yield, sometimes including exogenous variables such as weather or irrigation to improve forecasts (Pandit et al., 2023). These methods have been widely applied in multi‑crop settings, confirming their ability to deliver reasonable accuracy when relationships are approximately linear and data volumes are moderate.

However, the applicability of these models is constrained when yield responses become nonlinear or when high‑dimensional soil-climate inputs are considered. Comparative studies show that decomposition or ARIMA models can fit certain crops well, yet their performance is often surpassed by more flexible approaches once complex weather or pollutant variables are introduced. Hybrid time‑series structures that embed ARIMA inside broader frameworks (e.g., ARIMAX with exogenous variables) partially alleviate this issue but still require strong assumptions about residual structure and stationarity (Pandit et al., 2023). For citrus, with pronounced inter‑annual variability and intricate responses to water and temperature stress, these limitations motivate a shift toward machine learning models while still using traditional methods as baselines and for interpretability.

5.2 Construction of machine learning models

Machine learning models such as Random Forest (RF) and Support Vector Machines (SVM/SVR) provide flexible tools for modeling nonlinear, high‑order interactions among soil, climate, and management variables. RF aggregates many decision trees to reduce variance and has repeatedly emerged as one of the top performers for crop yield prediction, achieving high R² and low error across diverse environments (Shawon et al., 2024). In citrus, RF and SVM have been successfully applied to UAV‑derived spectral and texture features, where RF provided the best fruit quality predictions and SVM delivered competitive performance for fruit number (Figure 3) (Xu et al., 2025). These results highlight RF’s robustness to noisy multi‑source inputs and SVM’s strength in handling complex but relatively low‑dimensional feature sets.

Figure 3 Application of RF and SVM models using UAV-derived features for citrus fruit quality and yield prediction (Adopted from Xu et al., 2025)

Broader comparative analyses confirm that RF, gradient boosting and related ensembles frequently outperform simpler models such as linear regression or plain SVM in both accuracy and generalization. Support Vector Regression remains attractive in settings with limited samples and carefully engineered explanatory variables, and is often used as a benchmark against which deep learning and ensemble methods are evaluated (Kumar et al., 2023). For citrus yield prediction, constructing RF and SVM models typically involves careful feature selection (soil physicochemical indices, climate statistics, vegetation indices), tuning of depth, tree number, kernel type and regularization, and systematic comparison using metrics such as RMSE, MAE and R² to match model complexity with data availability (Shawon et al., 2024).

5.3 Model integration and parameter optimization strategies

Recent work shows that integrating multiple models through ensemble learning can substantially improve yield prediction reliability. Stacking frameworks that combine linear regression, tree‑based models (RF, XGBoost, LightGBM) and other regressors via a meta‑learner have achieved R² scores up to about 0.98, clearly surpassing any single constituent model. Similar stacking approaches using RF, XGBoost, decision trees and K‑nearest neighbors, sometimes augmented by synthetic data, further enhance stability and accuracy, with optimized ensembles reaching R² values above 0.99 and very low MAE (Waqar et al., 2025). In citrus applications, particle swarm optimization (CPSO) coupled with XGB and SVM has been used to simultaneously optimize feature subsets and model parameters, leading to marked gains in accuracy and reduced input dimensionality (Xu et al., 2025).

Hyperparameter optimization and automated search are central to these improvements. Studies employing Optuna to tune tree‑based ensembles across many regressors (Gradient Boosting, XGBoost, LightGBM, RF, Bagging, KNN) show that optimized gradient boosting can reach near‑perfect R² and minimal RMSE, far exceeding untuned baselines (Jayanthi et al., 2025). Deep learning models for yield forecasting likewise benefit from systematic tuning of learning rate, optimizer, and architecture, with optimized Bi‑LSTM models significantly reducing prediction errors compared with both simple LSTM and traditional ML models. For a citrus soil-climate framework, integrating RF, SVM, gradient boosting and possibly deep architectures within stacked or hybrid ensembles, guided by automated hyperparameter search and feature importance analysis (e.g., SHAP), offers a powerful route to robust, high‑precision yield prediction.

6 Performance Comparison of Different Models and Analysis of Driving Factors in Citrus Yield Prediction

6.1 Construction of prediction accuracy evaluation metrics

To compare citrus yield prediction models fairly, a set of standard error- and correlation-based metrics is required. Commonly used measures include root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and the coefficient of determination (R²), which together describe both average deviation and goodness of fit between observed and predicted yields (Almady et al., 2024). Crop‑model evaluation studies emphasize that RMSE and MAE capture different aspects of average error and recommend combining them with efficiency‑type indices or regression analysis for a robust assessment of model performance (Yang et al., 2014).

There is ongoing discussion about when RMSE or MAE is more appropriate. Some work argues that MAE is a more natural and unambiguous indicator of average model error, while RMSE is more sensitive to large deviations and therefore better reflects penalty for outliers. Later analyses note that RMSE is meaningful when errors are approximately Gaussian and satisfies properties of a distance metric, suggesting that both RMSE and MAE should be reported for model intercomparison (Xu et al., 2025). In recent citrus ANN studies, R², RMSE, MAE, and MAPE are jointly used to evaluate prediction quality in both training and testing phases, illustrating how a multi‑metric scheme can characterize accuracy, robustness, and generalization ability.

6.2 Comparative analysis of multi‑model prediction performance

Across crop yield applications, machine learning models often outperform traditional linear regression when relationships between yield, climate, and soil are nonlinear. For citrus, an ANN using weather factors and accumulated heat units achieved R² of 0.87 in training and 0.83 in testing, with low RMSE and MAE, clearly exceeding multiple linear regression (MLR) benchmarks whose R² ranged from 0.151 to 0.844 depending on cultivar (Almady et al., 2024). The same study showed that the ANN also outperformed data‑mining algorithms such as K‑nearest neighbor, KStar, and support vector regression, indicating that nonlinear architectures can better capture citrus responses to combined climatic drivers.

Comparable patterns appear in more general crop yield studies. Random Forest models trained on climate and biophysical variables produced RMSE values of only 6-14% of mean observed yield, whereas MLR errors ranged from 14% to 49%, demonstrating the advantage of ensemble trees at regional and global scales (Jeong et al., 2016). Gradient‑boosting methods such as CatBoost, LightGBM, and XGBoost also achieve very high R² (up to about 0.99) and low RMSE when predicting rice yield from weather and management inputs, confirming the strong capacity of boosting algorithms for yield prediction tasks. In citrus remote‑sensing applications, XGBoost and Random Forest generally outperform Gaussian process regression and stepwise regression, while stacked or meta‑heuristic‑optimized ensembles can further reduce error and improve stability over single learners (Xu et al., 2025).

6.3 Identification of key soil and climate driving factors

Model‑agnostic interpretation tools and sensitivity analyses are increasingly used to identify key soil and climate drivers behind yield predictions. For citrus under Egyptian conditions, ANN sensitivity results indicated that air relative humidity contributed the largest share (about 19.3%) to yield variation, highlighting the importance of moisture‑related atmospheric conditions for fruit production. In China, empirical regression linking citrus yield to climate showed that mean temperature in October and minimum temperature in November positively affected yield, while maximum temperature in September and relative humidity in October had negative impacts, underscoring the stage‑specific influence of thermal and humidity regimes during growth and ripening (Wang et al., 2022).

Explainable machine learning at broader scales reveals consistent patterns regarding climate-soil controls on yield. A random‑forest-LIME framework applied across the conterminous United States identified growing degree days as the dominant climatic factor for major crops, with soil water‑holding capacity emerging as a key edaphic property modulating yield responses. For citrus fruit yield estimated from UAV multispectral data, SHAP analysis showed that vegetation indices such as the normalized difference chlorophyll index and red‑band reflectance features strongly shaped model outputs, indirectly capturing canopy nitrogen status and vigor that are themselves conditioned by soil fertility and water supply (Xu et al., 2025). Together, these results suggest that temperature‑related indices, humidity and water availability, and soil hydraulic and fertility characteristics form the core set of driving factors in soil-climate coupled citrus yield prediction models.

7 Case Study: Application of Soil-Climate Driven Citrus Yield Prediction in a Typical Citrus-Producing Region

7.1 Analysis of soil and climate characteristics in the study area

In many major citrus‑producing regions, soils show marked spatial variability in fertility that strongly conditions yield potential. Surveys in typical orchards have revealed slightly alkaline, non‑saline soils with low organic matter and widespread deficiencies of nitrogen, phosphorus, and key micronutrients such as Fe and Zn across multiple depths (Figure 4) (Ahmad et al., 2022). Under such conditions, citrus leaves frequently exhibit nutrient deficiencies, indicating that soil fertilization alone is insufficient and that site‑specific nutrient management strategies, including foliar applications, are required to sustain productivity. At the same time, spatial analyses of soil nutrients using spatiotemporal kriging demonstrate clear patterns of evolving suitability, with some townships constrained by soil acidification or low available N and P, and others showing improving fertility trends (Wu et al., 2022).

Figure 4 Spatial distribution of soil fertility indicators (e.g., nitrogen, phosphorus, organic matter) across citrus-growing regions, highlighting heterogeneity in soil conditions (Adopted from Ahmad et al., 2022)

Climatic conditions in these regions are typically warm and humid, with distinct growing and ripening seasons that define the timing and magnitude of citrus yield responses. Empirical regression models in China have shown that monthly mean temperature and humidity during late growth and ripening stages are the dominant climatic determinants of citrus yield, while diurnal temperature range exerts a major influence on quality attributes. Projections based on CMIP6 scenarios further suggest that, in many provinces, future changes in temperature and humidity may actually improve climatic suitability for citrus growth, leading to increased yields even as quality responds heterogeneously across regions (Wang et al., 2022). Together, these soil and climate characteristics provide a heterogeneous but quantifiable environmental backdrop for testing soil-climate driven yield prediction models.

7.2 Application results of prediction models in actual production areas

Applying soil-climate driven models in real citrus orchards has demonstrated that integrating multi‑source data can substantially enhance prediction accuracy. In a Moroccan orchard comprising 50 parcels monitored over five years, machine learning models trained on climate, irrigation, fertilization doses, and phytosanitary treatments, together with NDVI and NDWI extracted from Sentinel‑2 and Landsat imagery, achieved promising mean absolute and squared errors at parcel scale (Moussaid et al., 2022). A subsequent deep learning study on the same orchard showed that a multi‑layer neural network using parcel‑level management and climatic data, coupled with NDVI/NDWI images before harvest, could reduce mean absolute error to about 0.145 t/ha and percentage error to 10%, indicating a strong capacity to capture intra‑orchard variation (Moussaid et al., 2023).

In larger commercial orchards, early‑season forecasting systems using time‑series Landsat vegetation indices and historical block yields have been able to explain high proportions of yield variability across farms, varieties, and years. For 315 Australian citrus blocks in three regions, support vector machines using bimonthly vegetation index time series and block‑yield history achieved R² ≈ 0.88, with RMSE around 15.5 t/ha and forecasts available up to nine months before harvest (Suarez et al., 2023). Climate‑driven neural network models in Egypt, based on precipitation, temperature, relative humidity, and accumulated heat units, have also yielded low mean absolute percentage errors (≈5.4%) for multiple citrus cultivars, outperforming multiple linear regression and several data‑mining algorithms. These case studies collectively illustrate that soil-climate-remote‑sensing‑based approaches can provide operationally useful forecasts in diverse production environments.

7.3 Model applicability and regional differences analysis

Regional differences in soil fertility, climate, and management strongly influence how well a given prediction model transfers between areas. Studies on environmental suitability in Chinese citrus‑producing counties show that the proportion of “moderately suitable” and “suitable” orchards-defined by combined soil nutrients, topography and climate-is positively correlated with annual yield, while low‑suitability zones are constrained by specific problems such as soil acidification or low available N and P (Wu et al., 2022). Similarly, detailed soil and leaf surveys in Pakistan highlight that citrus orchards on slightly alkaline, calcareous soils with low organic matter and multi‑nutrient deficiencies require tailored nutrient strategies, meaning that yield models calibrated in more fertile regions may systematically overestimate yields if directly transferred without adjustment.

Domain adaptation research for crop yield prediction indicates that differences in agro‑ecological zones, especially in growing degree days and vapor pressure deficit, can create “domain shifts” that degrade model performance when training and application regions differ (Priyatikanto et al., 2023). Partial domain adaptation and related approaches that selectively align feature distributions between source and target regions have been shown to improve the transferability of deep learning yield models, reducing negative transfer when label (yield) ranges do not fully overlap (Yuchi et al., 2023). For citrus, this implies that soil-climate yield models developed in a typical region can be extended to other production zones by explicitly accounting for regional differences in climate regimes and soil fertility classes, or by employing domain‑adaptation techniques to recalibrate feature-yield relationships, thereby enhancing robustness across heterogeneous citrus belts.

8 Uncertainty Effects of Soil-Climate Coupling on Citrus Yield Prediction

8.1 Analysis of soil-climate interaction effects

Soil-climate coupling shapes both the level and stability of crop yields, and thus strongly affects uncertainty in citrus yield prediction. Global simulations show that, under low fertilizer conditions, variability caused by soil type often exceeds interannual weather‑driven variability, because soil nutrient supply and hydrologic properties dominate yield responses in many regions (Folberth et al., 2016). When irrigation and nutrients are ample, soil‑ and climate‑driven variabilities become similar, implying that good management can partly buffer soil heterogeneity.Large multi‑model ensembles further indicate that climate and crop models contribute roughly equally to uncertainty in future yield projections, with soil and nutrient processes remaining a major unresolved component in many frameworks.

Soil properties also condition how crops respond to climate anomalies, altering both impacts and prediction reliability. Analyses across US crops reveal that yields on coarse‑textured soils are more sensitive to precipitation and temperature variability, whereas high soil organic carbon (>2%) reduces sensitivity by increasing water retention and buffering heat‑related water loss (Huang et al., 2021). In China, high‑quality soils similarly dampen yield responses to climate variability, increasing average yields by about 10% and reducing interannual variability by roughly 16%, thereby improving resilience under climate change. These findings imply that citrus orchards on degraded or coarse soils will exhibit stronger and less predictable responses to heat and drought extremes, amplifying uncertainty in soil-climate coupled models.

8.2 Sources of model error and uncertainty assessment

Uncertainty in soil-climate based yield prediction arises from data, model structure, and future climate forcing. Bayesian neural network yield models combining satellite, climate, soil, and historical data show that predictive uncertainty is mainly driven by observation noise and interannual environmental stress (heat, water deficits), with uncertainty decreasing as more in‑season information accumulates. Coupled statistical-physical frameworks demonstrate that stochastic variability in meteorological drivers propagates through crop models, and global sensitivity analysis can identify which climate variables dominate prediction uncertainty under specific scenarios (Chrispell et al., 2021).

At larger scales, process‑based ensembles reveal that uncertainty in management practices and soil inputs can exceed that from meteorological forcing, especially when fertilizer and irrigation are poorly constrained (Dokoohaki et al., 2021). Other intercomparison studies find that climate and crop model structures each contribute substantially to projection spread, with crop models often dominating in the mid‑century and climate scenarios becoming more important later in the century. Machine‑learning and ensemble approaches that integrate process‑based outputs with environmental covariates can reduce yield projection uncertainty by 30-70%, but residual dependence on global climate models and emission scenarios remains a major source of error in long‑term assessments.

8.3 Implications for agricultural management decision-making

Quantified uncertainty from soil-climate coupled models is crucial for risk‑aware citrus management. Probabilistic frameworks using Bayesian neural networks or hierarchical models provide prediction intervals in which 80-95% of observed yields are captured, allowing managers to assess downside risk and adjust practices accordingly (Bazrafshan et al., 2022). Seasonal forecasting studies show that as real‑time climate information replaces forecasts during the growing season, the spread of yield predictions shrinks, improving confidence in late‑season decisions such as harvest scheduling and short‑term irrigation planning.

Soil-climate analyses also offer strategic guidance on long‑term adaptation. Evidence that high‑quality soils raise mean yield and reduce climate‑driven variability suggests that investments in soil organic matter, structure, and nutrient status can both increase citrus productivity and lower yield risk under warming scenarios (Huang et al., 2021; Qiao et al., 2022). Large‑ensemble impact studies highlight that management and soil inputs are leverage points for reducing predictive uncertainty and real yield losses, implying that improving soil data, refining irrigation-fertilizer strategies, and tailoring cultivars to local soils can make citrus systems more resilient despite unavoidable climate and model uncertainties.

9 Conclusions and Future Directions for Citrus Yield Prediction Based on Soil and Climate Variables

Current research shows that citrus yield can be predicted with high accuracy when key soil and climate variables are combined with appropriate modeling techniques. Temperature, rainfall, humidity, and soil properties such as moisture and fertility consistently emerge as primary determinants of crop yield, particularly when augmented with vegetation indices like NDVI or NDWI that capture canopy status. In citrus, artificial neural networks using climatic variables and accumulated heat units have achieved low RMSE and MAPE and clearly outperformed multiple linear regression and several data‑mining algorithms, confirming the value of nonlinear models tailored to regional conditions. Across crops and especially for tree crops, machine learning and deep learning models have achieved accuracies ranging from roughly 50% to 99%, depending on dataset size, feature richness, and model complexity. Deep neural networks that integrate field management, soil-climate data, and satellite‑derived indices have demonstrated strong performance in citrus orchards, with very low mean absolute errors at parcel scale. At broader regional scales, support vector machines and ensemble methods using time‑series vegetation indices have also provided reliable early‑season forecasts for citrus block yields, highlighting that soil-climate driven models can be adapted from local parcel applications to larger production regions.

From a production standpoint, accurate soil-climate based yield prediction supports **precision agriculture** by enabling site‑specific management of inputs and operations. Reviews emphasize that combining sensor data, remote sensing and ML/DL models allows farmers to better understand the combined effects of water deficits, nutrient status and other stresses, thereby optimizing yield and quality while reducing costs and environmental impact. Citrus‑focused ANN models using weather and heat‑unit data can be embedded in decision systems to guide cultivar choice, harvest planning and logistical arrangements, particularly in regions facing climate variability. At policy level, AI‑driven yield prediction frameworks offer tools for strategic planning, food security assessment and climate‑change adaptation. IoT‑based systems that integrate climate, weather, yield and chemical data have been proposed as national‑scale tools to anticipate annual crop yields, with ensemble tree models achieving R² above 0.99 in some settings. Systematic reviews underline that robust crop‑yield prediction under abnormal climate requires careful identification and monitoring of key environmental factors-temperature, precipitation, soil moisture and fertility-which can be translated into indicators for risk assessment and targeted support programs. For citrus‑producing regions, such systems can inform regional zoning, irrigation investment, and incentives for soil‑fertility restoration.

Several research gaps point to directions for improving citrus yield prediction based on soil and climate variables. Deep learning and remote sensing reviews stress the need for larger, open, and well‑annotated datasets, as current work often relies on limited samples and heterogeneous features, which constrain generalization. Future studies should more systematically integrate multi‑source information-high‑resolution soil maps, detailed climate records, and multi‑temporal vegetation indices-and benchmark a range of ML/DL architectures (e.g., CNN, LSTM, DNN) under standardized evaluation protocols. Methodologically, there is a strong case for developing hybrid and explainable frameworks. Recent surveys highlight the promise of ensemble learning and XAI in clarifying the roles of climatic and soil factors, especially under abnormal climate conditions. At the same time, analyses using synthetic datasets warn that sophisticated ML models may add limited value over simple baselines if partitioning and validation do not reflect true forecasting conditions, underscoring the need for fair experimental designs and comparison with “best‑guess” benchmarks. For citrus, advancing domain‑adaptation strategies across regions, improving model interpretability for growers, and incorporating evolving climate scenarios will be key to building robust, transferable soil-climate prediction systems that genuinely support sustainable citrus production.

Acknowledgments

I would like to thank the anonymous reviewers for their detailed review of the draft. Their specific feedback helped us correct the logical loopholes in our arguments.

Conflict of Interest Disclosure

The author affirms that this research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Abbas F., Afzaal H., Farooque A., and Tang S., 2020, Crop yield prediction through proximal sensing and machine learning algorithms, Agronomy, 10(7): 1046.

https://doi.org/10.3390/agronomy10071046

Ahmad N., Hussain S., Ali M., Minhas A., Waheed W., Danish S., Fahad S., Ghafoor U., Baig K., Sultan H., Hussain M., Ansari M., Marfo T., and Datta R., 2022, Correlation of soil characteristics and citrus leaf nutrients contents in current scenario of Layyah district, Horticulturae, 8(1): 61.

https://doi.org/10.3390/horticulturae8010061

Almady S., Abdel-Sattar M., Al-Sager S., Al-Hamed S., and Aboukarima A., 2024, Employing an artificial neural network model to predict citrus yield based on climate factors, Agronomy, 14(7): 1548.

https://doi.org/10.3390/agronomy14071548

Ansarifar J., Wang L., and Archontoulis S., 2021, An interaction regression model for crop yield prediction, Scientific Reports, 11: 17207.

https://doi.org/10.1038/s41598-021-97221-7

Anu C., Nirmala C., Bhowmik A., and Santhosh A., 2025, Optimizing crop yield prediction: An in-depth analysis of outlier detection algorithms on Davangere region, The Scientific World Journal, 2025: 9312639.

https://doi.org/10.1155/tswj/9312639

Bazrafshan O., Ehteram M., Moshizi Z., and Jamshidi S., 2022, Evaluation and uncertainty assessment of wheat yield prediction by multilayer perceptron model with bayesian and copula bayesian approaches, Agricultural Water Management, 270: 107881.

https://doi.org/10.1016/j.agwat.2022.107881

Burdett H. and Wellen C., 2022, Statistical and machine learning methods for crop yield prediction in the context of precision agriculture, Precision Agriculture, 23(5): 1553-1574.

https://doi.org/10.1007/s11119-022-09897-0

Călin A., Coroiu A., and Muresan H., 2023, Analysis of preprocessing techniques for missing data in the prediction of sunflower yield in response to the effects of climate change, Applied Sciences, 13(13): 7415.

https://doi.org/10.3390/app13137415

Catalano G., D’Urso P., and Arcidiacono C., 2025, SDM- and GIS-based prediction of citrus suitability in southern Italy: Evaluating the influence of local versus global climate datasets, Land, 14(11): 2223.

https://doi.org/10.3390/land14112223

Chrispell J., Jenkins E., Kavanagh K., and Parno M., 2021, Characterizing prediction uncertainty in agricultural modeling via a coupled statistical-physical framework, Modelling, 2(4): 40.

https://doi.org/10.3390/modelling2040040

Cooper M., Tang T., Gho C., Hart T., Hammer G., and Messina C., 2020, Integrating genetic gain and gap analysis to predict improvements in crop productivity, Crop Science, 60(2): 582-604.

https://doi.org/10.1002/csc2.20109

Díaz I., Mazza S., Combarro E., Giménez L., and Gaiad J., 2017, Machine learning applied to the prediction of citrus production, Spanish Journal of Agricultural Research, 15(2): e0205.

https://doi.org/10.5424/sjar/2017152-9090

Dokoohaki H., Kivi M., Martinez-Feria R., Miguez F., and Hoogenboom G., 2021, A comprehensive uncertainty quantification of large-scale process-based crop modeling frameworks, Environmental Research Letters, 16(10): 104040.

https://doi.org/10.1088/1748-9326/ac0f26

El-Mahroug S., Suleiman A., Zoubi M., Al-Omari S., Abu-Afifeh Q., Al-Jawaldeh H., Alta’any Y., Al-Nawaiseh T., Obeidat N., Alsoud S., Alshoshan A., Al-Shibli F., and Ta’any R., 2025, Predictive modeling of climate-driven crop yield variability using DSSAT towards sustainable agriculture, AgriEngineering, 7(5): 156.

https://doi.org/10.3390/agriengineering7050156

Folberth C., Skalský R., Moltchanova E., Balkovič J., Azevedo L., Obersteiner M., and Van Der Velde M., 2016, Uncertainty in soil data can outweigh climate impact signals in global crop yield simulations, Nature Communications, 7: 11872.

https://doi.org/10.1038/ncomms11872

Gupta S., Geetha A., Sankaran K., Zamani A., Ritonga M., Raj R., Ray S., and Mohammed H., 2022, Machine learning- and feature selection-enabled framework for accurate crop yield prediction, Journal of Food Quality, 2022: 6293985.

https://doi.org/10.1155/2022/6293985

Huang J., Hartemink A., and Kucharik C., 2021, Soil-dependent responses of US crop yields to climate variability and depth to groundwater, Agricultural Systems, 190: 103085.

https://doi.org/10.1016/j.agsy.2021.103085

Iniyan S., Varma V., and Naidu C., 2023, Crop yield prediction using machine learning techniques, Advances in Engineering Software, 175: 103326.

https://doi.org/10.1016/j.advengsoft.2022.103326

Jeong J., Resop J., Mueller N., Fleisher D., Yun K., Butler E., Timlin D., Shim K., Gerber J., Reddy V., and Kim S., 2016, Random forests for global and regional crop yield predictions, PLoS ONE, 11(6): e0156571.

https://doi.org/10.1371/journal.pone.0156571

Khaki S., Wang L., and Archontoulis S., 2019, A CNN-RNN framework for crop yield prediction, Frontiers in Plant Science, 10: 1750.

https://doi.org/10.3389/fpls.2019.01750

Kumar V., Ramesh K., and Rakesh V., 2023, Optimizing LSTM and Bi-LSTM models for crop yield prediction and comparison of their performance with traditional machine learning techniques, Applied Intelligence, 53(24): 28291-28309.

https://doi.org/10.1007/s10489-023-05005-5

Liu Y., Heying E., and Tanumihardjo S., 2012, History, global distribution, and nutritional importance of citrus fruits, Comprehensive Reviews in Food Science and Food Safety, 11(6): 530-545.

https://doi.org/10.1111/j.1541-4337.2012.00201.x

Mahesh P. and Soundrapandiyan R., 2024, Yield prediction for crops by gradient-based algorithms, PLoS ONE, 19: e0291928.

https://doi.org/10.1371/journal.pone.0291928

Moussaid A., Fkihi S., Zennayi Y., Kassou I., Bourzeix F., Lahlou O., Mansouri L., and Imani Y., 2023, Citrus yield prediction using deep learning techniques: A combination of field and satellite data, Journal of Open Innovation: Technology, Market, and Complexity, 9: 100075.

https://doi.org/10.1016/j.joitmc.2023.100075

Paudel D., Boogaard H., Wit A., Janssen S., Osinga S., Pylianidis C., and Athanasiadis I., 2020, Machine learning for large-scale crop yield forecasting, Agricultural Systems, 187: 103016.

https://doi.org/10.1016/j.agsy.2020.103016

Yin X., Leng G., and Yu L., 2022, Disentangling the separate and confounding effects of temperature and precipitation on global maize yield using machine learning, statistical and process crop models, Environmental Research Letters, 17(3): 034015.

https://doi.org/10.1088/1748-9326/ac5716

Zhong G., and Nicolosi E., 2020, Citrus origin, diffusion, and economic importance, in: Citrus Fruit Biology, 1: 5-21.

https://doi.org/10.1007/978-3-030-15308-3_2

Computational Molecular Biology

• Volume 16

View Options
. PDF
. HTML
Associated material
. Readers' comments
Other articles by authors
. Weiping Wu

Related articles
. Citrus yield prediction

. Soil variables

. Climate factors

. Machine learning

. Precision agriculture

Tools
. Post a comment